OMGEval: an Open Multilingual Generative Evaluation Benchmark for Large Language Models
Yang Liu,Meng Xu,Shuo Wang,Liner Yang,Haoyu Wang,Zhenghao Liu,Cunliang Kong,Yun Chen,Maosong Sun,Erhong Yang
DOI: https://doi.org/10.48550/arxiv.2402.13524
2024-01-01
Abstract:Modern large language models (LLMs) should generally benefit individuals fromvarious cultural backgrounds around the world. However, most recent advancedgenerative evaluation benchmarks tailed for LLMs mainly focus on English. Tothis end, we introduce OMGEval, the first Open-source Multilingual Generativetest set that can assess the capability of LLMs in different languages. Foreach language, OMGEval provides 804 open-ended questions, covering a wide rangeof important capabilities of LLMs, such as general knowledge, logicalreasoning, and so on. Each question is rigorously verified by human annotators.Notably, to sufficiently reflect the compatibility of LLMs in differentcultural backgrounds, we perform localization for each non-English language.Specifically, the current version of OMGEval includes 5 languages (i.e., Zh,Ru, Fr, Es, Ar). Following AlpacaEval, we employ GPT-4 as the adjudicator toautomatically score different model outputs, which is shown closely related tohuman evaluation. We evaluate several representative multilingual LLMs on theproposed OMGEval, which we believe will provide a valuable reference for thecommunity to further understand and improve the multilingual capability ofLLMs. OMGEval is available at https://github.com/blcuicall/OMGEval.