Abstract:Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primarily because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.

Y-NQ: English-Yorùbá Evaluation dataset for Open-Book Reading Comprehension and Text Generation

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects

Yankari: A Monolingual Yoruba Dataset

NaijaRC: A Multi-choice Reading Comprehension Dataset for Nigerian Languages

OMGEval: an Open Multilingual Generative Evaluation Benchmark for Large Language Models

IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models

The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants

Investigating a Benchmark for Training-set free Evaluation of Linguistic Capabilities in Machine Reading Comprehension

Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains

ELQA: A Corpus of Metalinguistic Questions and Answers about English

Cross-lingual Open-Retrieval Question Answering for African Languages

XQA: A Cross-lingual Open-domain Question Answering Dataset

QuALITY: Question Answering with Long Input Texts, Yes!

Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension

FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages

Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination

Can a Multichoice Dataset be Repurposed for Extractive Question Answering?

Improving Yorùbá Diacritic Restoration