Abstract:In this work, we present the largest benchmark to date on linguistic acceptability: Multilingual Evaluation of Linguistic Acceptability -- MELA, with 46K samples covering 10 languages from a diverse set of language families. We establish LLM baselines on this benchmark, and investigate cross-lingual transfer in acceptability judgements with XLM-R. In pursuit of multilingual interpretability, we conduct probing experiments with fine-tuned XLM-R to explore the process of syntax capability acquisition. Our results show that GPT-4o exhibits a strong multilingual ability, outperforming fine-tuned XLM-R, while open-source multilingual models lag behind by a noticeable gap. Cross-lingual transfer experiments show that transfer in acceptability judgment is non-trivial: 500 Icelandic fine-tuning examples lead to 23 MCC performance in a completely unrelated language -- Chinese. Results of our probing experiments indicate that training on MELA improves the performance of XLM-R on syntax-related tasks. Our data is available at <a class="link-external link-https" href="https://github.com/sjtu-compling/MELA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to construct a large - scale multilingual linguistic acceptability benchmark - MELA (Multilingual Evaluation of Linguistic Acceptability) to systematically evaluate the capabilities of multilingual models in the task of grammatical acceptability judgment. Specifically, the paper aims to: 1. **Fill the gap in multilingual benchmarks**: Currently, most existing language acceptability datasets mainly focus on English and lack multilingual benchmarks covering multiple languages. MELA includes languages from 10 different language families and provides more than 46,000 samples, filling this gap. 2. **Evaluate the capabilities of multilingual models**: Through MELA, researchers can systematically evaluate the performance of various multilingual models (such as XLM - R, BLOOMZ, mTk, mT0, Baichuan2 - Chat, GPT - 3.5 and GPT - 4) in the task of grammatical acceptability judgment, especially the cross - language transfer capabilities of these models between different languages. 3. **Explore the acquisition of grammatical ability**: By fine - tuning MELA, researchers can explore the performance improvement of multilingual models in grammar - related tasks and how these models acquire grammatical knowledge from acceptability judgment tasks. ### Main contributions 1. **Construct the MELA benchmark**: MELA is currently the largest multilingual linguistic acceptability benchmark, covering languages from 10 different language families and providing rich data resources. 2. **Evaluate multilingual models**: The paper benchmarks multiple multilingual models and finds that GPT - 4 performs excellently in zero - shot and few - shot settings, especially outperforming other models in low - resource languages. 3. **Cross - language transfer research**: Through experiments, researchers find that cross - language transfer is possible even between completely unrelated languages. For example, with only 500 Icelandic samples for fine - tuning, the model's performance on Chinese also reaches an MCC of 23.16. 4. **Probe experiments on grammatical ability**: Through probe experiments on XLM - R after fine - tuning MELA, researchers find that fine - tuning does improve the model's performance in grammar - related tasks, such as dependency relation tagging, syntactic tree structure tagging, etc. ### Conclusion The MELA benchmark not only provides an important tool for the evaluation of multilingual models but also offers a new perspective for studying the grammatical ability and cross - language transfer of multilingual models. Through this benchmark, researchers can gain a deeper understanding of the performance and limitations of multilingual models when processing different languages.

MELA: Multilingual Evaluation of Linguistic Acceptability

P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs

MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment

OMGEval: an Open Multilingual Generative Evaluation Benchmark for Large Language Models

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Towards Multilingual LLM Evaluation for European Languages

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

McEval: Massively Multilingual Code Evaluation

MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

CMMLU: Measuring massive multitask language understanding in Chinese

XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation

Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries

MaLA-500: Massive Language Adaptation of Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Revisiting Acceptability Judgements

The Model Arena for Cross-lingual Sentiment Analysis: A Comparative Study in the Era of Large Language Models

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

X-ALMA: Plug & Play Modules and Adaptive Rejection for Quality Translation at Scale