Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

Yiqiao Jin,Mohit Chandra,Gaurav Verma,Yibo Hu,Munmun De Choudhury,Srijan Kumar

2023-10-24

Abstract:Large language models (LLMs) are transforming the ways the general public accesses and consumes information. Their influence is particularly pronounced in pivotal sectors like healthcare, where lay individuals are increasingly appropriating LLMs as conversational agents for everyday queries. While LLMs demonstrate impressive language understanding and generation proficiencies, concerns regarding their safety remain paramount in these high-stake domains. Moreover, the development of LLMs is disproportionately focused on English. It remains unclear how these LLMs perform in the context of non-English languages, a gap that is critical for ensuring equity in the real-world use of these systems.This paper provides a framework to investigate the effectiveness of LLMs as multi-lingual dialogue systems for healthcare queries. Our empirically-derived framework XlingEval focuses on three fundamental criteria for evaluating LLM responses to naturalistic human-authored health-related questions: correctness, consistency, and verifiability. Through extensive experiments on four major global languages, including English, Spanish, Chinese, and Hindi, spanning three expert-annotated large health Q&A datasets, and through an amalgamation of algorithmic and human-evaluation strategies, we found a pronounced disparity in LLM responses across these languages, indicating a need for enhanced cross-lingual capabilities. We further propose XlingHealth, a cross-lingual benchmark for examining the multilingual capabilities of LLMs in the healthcare context. Our findings underscore the pressing need to bolster the cross-lingual capacities of these models, and to provide an equitable information ecosystem accessible to all.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

This paper attempts to solve the following two main problems: 1. **Ensure the safety of large language models (LLMs) in high - risk areas, especially in the healthcare field**: - LLMs are widely used in critical areas such as healthcare, but the information provided by these models may be inaccurate or incomplete, which may have a serious negative impact on users. Therefore, it is crucial to ensure the correctness, consistency, and verifiability of LLMs when answering medical queries. - Specifically, the paper proposes three evaluation criteria: - **Correctness**: The model's answers should be factually accurate and comprehensively answer the query. - **Consistency**: The model should generate similar answers to the same query and maintain a high degree of similarity in terms of vocabulary, semantics, and topic. - **Verifiability**: The model should be able to verify the correctness of its answers and clearly distinguish between correct and incorrect answers. 2. **Solve the problem of language differences and ensure fairness in a multilingual environment**: - Currently, the development and evaluation of most LLMs are mainly focused on English, which may lead to a poor experience for non - English users, especially when more than 80% of the global population does not use English as their native or second language. - To ensure fairness among users of different languages, the paper proposes a cross - language evaluation framework (XLingEval) and conducts experiments in four major languages (English, Spanish, Chinese, and Hindi), revealing that there are significant differences in the performance of LLMs in different languages. Through research in these two aspects, the paper aims to promote the safety and fairness of LLMs in the healthcare field and ensure that users of all languages can obtain reliable information.

Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

Evaluating large language models in medical applications: a survey

Large Language Models in Healthcare: A Comprehensive Benchmark

A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics

Large language models in medical and healthcare fields: applications, advances, and challenges

Understanding the concerns and choices of public when using large language models for healthcare

Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding

Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs

A framework for human evaluation of large language models in healthcare derived from literature review

Large Language Model-Based Evaluation of Medical Question Answering Systems: Algorithm Development and Case Study

Large language models encode clinical knowledge

Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams

Beyond Multiple-Choice Accuracy: Real-World Challenges of Implementing Large Language Models in Healthcare

Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

Large language models in healthcare and medical domain: A review

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios

Leveraging Large Language Models for Improved Patient Access and Self-Management in Oral Healthcare: an Assessor-blinded Preclinical Study (Preprint)

A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry

Integrating UMLS Knowledge into Large Language Models for Medical Question Answering