Abstract:Current ophthalmology clinical workflows are plagued by over-referrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages, allowing for direct cross-lingual comparisons. Our evaluation of 6 popular LLMs across 7 different languages reveals substantial bias across different languages, highlighting risks for clinical deployment of LLMs in LMICs. Existing debiasing methods such as Translation Chain-of-Thought or Retrieval-augmented generation (RAG) by themselves fall short of closing this performance gap, often failing to improve performance across all languages and lacking specificity for the medical domain. To address this issue, We propose CLARA (Cross-Lingual Reflective Agentic system), a novel inference time de-biasing method leveraging retrieval augmented generation and self-verification. Our approach not only improves performance across all languages but also significantly reduces the multilingual bias gap, facilitating equitable LLM application across the globe.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the performance differences and biases of large - scale language models (LLMs) in multilingual ophthalmology question - answering tasks, especially the poor performance in languages used in low - and middle - income countries (LMICs). Specifically: 1. **Performance gap caused by language differences**: Existing large - language models show significant performance differences in natural - language question - answering tasks across different languages, especially when dealing with languages from low - and middle - income countries, such as Filipino, Hindi, etc. This gap may lead to greater challenges for these countries when using LLMs for medical assistance. 2. **Lack of professional knowledge in the medical field**: LLMs show deficiencies in handling complex ophthalmology clinical and surgical problems and lack sufficient professional knowledge in the medical field. These problems are more prominent especially when dealing with languages from low - and middle - income countries. 3. **Limitations of existing de - biasing methods**: Existing de - biasing methods such as Translation Chain - of - Thought or Retrieval - Augmented Generation (RAG) have limited effectiveness in improving multilingual performance, fail to effectively reduce the cross - language performance gap, and lack specific support for the medical field. To solve these problems, the author proposes a novel de - biasing method at inference time named CLARA (Cross - Lingual Reflective Agentic system). CLARA aims to improve the accuracy and fairness of LLMs in multilingual ophthalmology question - answering tasks, reduce the cross - language performance gap, and thus promote more balanced AI medical applications worldwide by combining multi - agent collaborative mechanisms such as translation, evaluation, knowledge enhancement, and self - validation. ### Main contributions 1. **Constructed the first multilingual ophthalmology question - answering benchmark dataset**: Multi - OphthaLingua contains 1,184 questions, covering English, Spanish, Filipino, Portuguese, Chinese, French, and Hindi, supporting direct cross - language comparison. 2. **In - depth analysis of the reasons for the failure of LLMs in multilingual ophthalmology question - answering**: Through quantitative and qualitative analysis, the performance differences of LLMs when dealing with different languages and the factors behind them are revealed. 3. **Proposed a new de - biasing method, CLARA**: Through extensive experimental verification, it is proved that CLARA can significantly improve the accuracy and fairness of multilingual ophthalmology question - answering tasks. These contributions are helpful in promoting the medical applications of large - scale language models in multilingual environments, especially providing more reliable medical assistance tools for low - and middle - income countries.

Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs

Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

CliBench: A Multifaceted and Multigranular Evaluation of Large Language Models for Clinical Decision Making

MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering

Development and evaluation of a large language model of ophthalmology in Chinese

Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology

Large language models encode clinical knowledge

Large Language Models in Healthcare: A Comprehensive Benchmark

MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

Evaluating multiple large language models in pediatric ophthalmology

Ophtha-LLaMA2: A Large Language Model for Ophthalmology

AI Hospital: Interactive Evaluation and Collaboration of LLMs As Intern Doctors for Clinical Diagnosis

OphGLM: An ophthalmology large language-and-vision assistant

Teaching LLMs to Abstain across Languages via Multilingual Feedback

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

A Framework to Assess Clinical Safety and Hallucination Rates of LLMs for Medical Text Summarisation

Assessing equitable use of large language models for clinical decision support in real-world settings: fine-tuning and internal-external validation using electronic health records from South Asia

A toolbox for surfacing health equity harms and biases in large language models