MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

Yongqi Fan,Hongli Sun,Kui Xue,Xiaofan Zhang,Shaoting Zhang,Tong Ruan

2024-06-21

Abstract:Numerous advanced Large Language Models (LLMs) now support context lengths up to 128K, and some extend to 200K. Some benchmarks in the generic domain have also followed up on evaluating long-context capabilities. In the medical domain, tasks are distinctive due to the unique contexts and need for domain expertise, necessitating further evaluation. However, despite the frequent presence of long texts in medical scenarios, evaluation benchmarks of long-context capabilities for LLMs in this field are still rare. In this paper, we propose MedOdyssey, the first medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical-context "needles in a haystack" task and a series of tasks specific to medical applications, together comprising 10 datasets. The first component includes challenges such as counter-intuitive reasoning and novel (unknown) facts injection to mitigate knowledge leakage and data contamination of LLMs. The second component confronts the challenge of requiring professional medical expertise. Especially, we design the ``Maximum Identical Context'' principle to improve fairness by guaranteeing that different LLMs observe as many identical contexts as possible. Our experiment evaluates advanced proprietary and open-source LLMs tailored for processing long contexts and presents detailed performance analyses. This highlights that LLMs still face challenges and need for further research in this area. Our code and data are released in the repository: \url{<a class="link-external link-https" href="https://github.com/JOHNNY-fans/MedOdyssey" rel="external noopener nofollow">this https URL</a>.}

Computation and Language

What problem does this paper attempt to address?

The paper aims to address the evaluation issues of large language models (LLMs) in the medical field under long text contexts. Specifically, although many current advanced LLMs can support context lengths of up to 128K or even 200K tokens, and there are already long text evaluation benchmarks in general domains, such evaluations are still relatively scarce in the medical field. The paper introduces MedOdyssey, the first long-context evaluation benchmark specifically for the medical field, which includes seven different length levels (from 4K to 200K tokens) and designs two main tasks: 1. **Medical Context "Needle in a Haystack" Task**: By inserting irrelevant knowledge fragments (needles) into long texts and then requiring LLMs to answer questions about these fragments, this task evaluates the model's ability to handle long texts. 2. **A series of medical-related tasks**: Including term normalization, question answering based on medical knowledge graphs, table-based question answering, and case-based question answering. Additionally, to ensure fairness, the researchers introduced the "Maximum Identical Context" principle to ensure that different models can be evaluated in the same contextual environment. The paper also employs methods such as Counter-intuitive Reasoning and Novel Facts Injection to prevent data leakage and contamination. Through experimental evaluations of various advanced closed-source and open-source LLMs, the results show that existing models still face many challenges in handling long medical texts, especially in terms of formatted output and complex reasoning. Therefore, the paper emphasizes the importance of further research to improve the performance of LLMs in long-text scenarios in the medical field.

MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?

A Benchmark for Long-Form Medical Question Answering

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

Towards Evaluating and Building Versatile Large Language Models for Medicine

Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

Fine-Tuning Medical Language Models for Enhanced Long-Contextual Understanding and Domain Expertise

Large Language Model Benchmarks in Medical Tasks

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models

XL$^2$Bench: A Benchmark for Extremely Long Context Understanding with Long-range Dependencies

LongBoX: Evaluating Transformers on Long-Sequence Clinical Tasks

M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models

CMB: A Comprehensive Medical Benchmark in Chinese

Marathon: A Race Through the Realm of Long Context with Large Language Models

LooGLE: Can Long-Context Language Models Understand Long Contexts?

RULER: What's the Real Context Size of Your Long-Context Language Models?