LongHealth: A Question Answering Benchmark with Long Clinical Documents

Lisa Adams,Felix Busch,Tianyu Han,Jean-Baptiste Excoffier,Matthieu Ortala,Alexander Löser,Hugo JWL. Aerts,Jakob Nikolas Kather,Daniel Truhn,Keno Bressem
2024-01-26
Abstract:Background: Recent advancements in large language models (LLMs) offer potential benefits in healthcare, particularly in processing extensive patient records. However, existing benchmarks do not fully assess LLMs' capability in handling real-world, lengthy clinical data.
Computation and Language
What problem does this paper attempt to address?
This paper introduces a new benchmark test called LongHealth for evaluating the performance of Large Language Models (LLMs) in processing lengthy clinical documents. Current medical benchmark tests fail to fully test the performance of LLMs in handling real clinical data, especially complex and lengthy medical records. LongHealth consists of 20 fictional but structurally realistic patient cases, each composed of multiple discharge summaries with a total word count ranging from 5,090 to 6,754 words. The test challenges LLMs' information extraction, negation comprehension, and ranking abilities through 400 multiple-choice questions. The paper mentions that although LLMs have the potential to improve healthcare efficiency, such as quickly scanning large amounts of data and extracting key information, existing models perform poorly in tasks that involve identifying missing information. Nine open-source LLMs and the GPT-3.5 Turbo model were evaluated, and it was found that Mixtral-8x7B-Instruct-v0.1 performed the best in information retrieval tasks. However, all models struggled when dealing with situations that require identifying missing information. In conclusion, despite demonstrating potential in processing lengthy clinical documents, the current level of accuracy of LLMs is not sufficient to support reliable clinical applications, particularly in scenarios involving missing information identification. The LongHealth benchmark test provides a more realistic assessment of the practical performance of LLMs in a medical setting and emphasizes the need for further improvements in the models to achieve safe and effective clinical applications.