Abstract:Large language models (LLMs) are transforming the ways the general public accesses and consumes information. Their influence is particularly pronounced in pivotal sectors like healthcare, where lay individuals are increasingly appropriating LLMs as conversational agents for everyday queries. While LLMs demonstrate impressive language understanding and generation proficiencies, concerns regarding their safety remain paramount in these high-stake domains. Moreover, the development of LLMs is disproportionately focused on English. It remains unclear how these LLMs perform in the context of non-English languages, a gap that is critical for ensuring equity in the real-world use of these systems.This paper provides a framework to investigate the effectiveness of LLMs as multi-lingual dialogue systems for healthcare queries. Our empirically-derived framework XlingEval focuses on three fundamental criteria for evaluating LLM responses to naturalistic human-authored health-related questions: correctness, consistency, and verifiability. Through extensive experiments on four major global languages, including English, Spanish, Chinese, and Hindi, spanning three expert-annotated large health Q&A datasets, and through an amalgamation of algorithmic and human-evaluation strategies, we found a pronounced disparity in LLM responses across these languages, indicating a need for enhanced cross-lingual capabilities. We further propose XlingHealth, a cross-lingual benchmark for examining the multilingual capabilities of LLMs in the healthcare context. Our findings underscore the pressing need to bolster the cross-lingual capacities of these models, and to provide an equitable information ecosystem accessible to all.

Since the Scientific Literature Is Multilingual, Our Models Should Be Too

The Less the Merrier? Investigating Language Representation in Multilingual Models

Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead?

Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models

Do Large Language Models Have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs

The Roles of English in Evaluating Multilingual Language Models

Fairness in Language Models Beyond English: Gaps and Challenges

An Interdisciplinary Outlook on Large Language Models for Scientific Research

Towards Efficient Large Language Models for Scientific Text: A Review

How should the advent of large language models affect the practice of science?

Could We Have Had Better Multilingual LLMs If English Was Not the Central Language?

Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts

Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

Beyond English-Centric LLMs: What Language Do Multilingual Language Models Think in?

Overcoming Language Barriers in Academia: Machine Translation Tools and a Vision for a Multilingual Future

Multilingual BERT has an accent: Evaluating English influences on fluency in multilingual models

Double Jeopardy and Climate Impact in the Use of Large Language Models: Socio-economic Disparities and Reduced Utility for Non-English Speakers

Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing

A global AI community requires language-diverse publishing

Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance