Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

Yiqiao Jin,Mohit Chandra,Gaurav Verma,Yibo Hu,Munmun De Choudhury,Srijan Kumar
2023-10-24
Abstract:Large language models (LLMs) are transforming the ways the general public accesses and consumes information. Their influence is particularly pronounced in pivotal sectors like healthcare, where lay individuals are increasingly appropriating LLMs as conversational agents for everyday queries. While LLMs demonstrate impressive language understanding and generation proficiencies, concerns regarding their safety remain paramount in these high-stake domains. Moreover, the development of LLMs is disproportionately focused on English. It remains unclear how these LLMs perform in the context of non-English languages, a gap that is critical for ensuring equity in the real-world use of these systems.This paper provides a framework to investigate the effectiveness of LLMs as multi-lingual dialogue systems for healthcare queries. Our empirically-derived framework XlingEval focuses on three fundamental criteria for evaluating LLM responses to naturalistic human-authored health-related questions: correctness, consistency, and verifiability. Through extensive experiments on four major global languages, including English, Spanish, Chinese, and Hindi, spanning three expert-annotated large health Q&A datasets, and through an amalgamation of algorithmic and human-evaluation strategies, we found a pronounced disparity in LLM responses across these languages, indicating a need for enhanced cross-lingual capabilities. We further propose XlingHealth, a cross-lingual benchmark for examining the multilingual capabilities of LLMs in the healthcare context. Our findings underscore the pressing need to bolster the cross-lingual capacities of these models, and to provide an equitable information ecosystem accessible to all.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the following two main problems: 1. **Ensure the safety of large language models (LLMs) in high - risk areas, especially in the healthcare field**: - LLMs are widely used in critical areas such as healthcare, but the information provided by these models may be inaccurate or incomplete, which may have a serious negative impact on users. Therefore, it is crucial to ensure the correctness, consistency, and verifiability of LLMs when answering medical queries. - Specifically, the paper proposes three evaluation criteria: - **Correctness**: The model's answers should be factually accurate and comprehensively answer the query. - **Consistency**: The model should generate similar answers to the same query and maintain a high degree of similarity in terms of vocabulary, semantics, and topic. - **Verifiability**: The model should be able to verify the correctness of its answers and clearly distinguish between correct and incorrect answers. 2. **Solve the problem of language differences and ensure fairness in a multilingual environment**: - Currently, the development and evaluation of most LLMs are mainly focused on English, which may lead to a poor experience for non - English users, especially when more than 80% of the global population does not use English as their native or second language. - To ensure fairness among users of different languages, the paper proposes a cross - language evaluation framework (XLingEval) and conducts experiments in four major languages (English, Spanish, Chinese, and Hindi), revealing that there are significant differences in the performance of LLMs in different languages. Through research in these two aspects, the paper aims to promote the safety and fairness of LLMs in the healthcare field and ensure that users of all languages can obtain reliable information.