MELA: Multilingual Evaluation of Linguistic Acceptability

Ziyin Zhang,Yikang Liu,Weifang Huang,Junyu Mao,Rui Wang,Hai Hu
2024-06-06
Abstract:In this work, we present the largest benchmark to date on linguistic acceptability: Multilingual Evaluation of Linguistic Acceptability -- MELA, with 46K samples covering 10 languages from a diverse set of language families. We establish LLM baselines on this benchmark, and investigate cross-lingual transfer in acceptability judgements with XLM-R. In pursuit of multilingual interpretability, we conduct probing experiments with fine-tuned XLM-R to explore the process of syntax capability acquisition. Our results show that GPT-4o exhibits a strong multilingual ability, outperforming fine-tuned XLM-R, while open-source multilingual models lag behind by a noticeable gap. Cross-lingual transfer experiments show that transfer in acceptability judgment is non-trivial: 500 Icelandic fine-tuning examples lead to 23 MCC performance in a completely unrelated language -- Chinese. Results of our probing experiments indicate that training on MELA improves the performance of XLM-R on syntax-related tasks. Our data is available at <a class="link-external link-https" href="https://github.com/sjtu-compling/MELA" rel="external noopener nofollow">this https URL</a>.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to construct a large - scale multilingual linguistic acceptability benchmark - MELA (Multilingual Evaluation of Linguistic Acceptability) to systematically evaluate the capabilities of multilingual models in the task of grammatical acceptability judgment. Specifically, the paper aims to: 1. **Fill the gap in multilingual benchmarks**: Currently, most existing language acceptability datasets mainly focus on English and lack multilingual benchmarks covering multiple languages. MELA includes languages from 10 different language families and provides more than 46,000 samples, filling this gap. 2. **Evaluate the capabilities of multilingual models**: Through MELA, researchers can systematically evaluate the performance of various multilingual models (such as XLM - R, BLOOMZ, mTk, mT0, Baichuan2 - Chat, GPT - 3.5 and GPT - 4) in the task of grammatical acceptability judgment, especially the cross - language transfer capabilities of these models between different languages. 3. **Explore the acquisition of grammatical ability**: By fine - tuning MELA, researchers can explore the performance improvement of multilingual models in grammar - related tasks and how these models acquire grammatical knowledge from acceptability judgment tasks. ### Main contributions 1. **Construct the MELA benchmark**: MELA is currently the largest multilingual linguistic acceptability benchmark, covering languages from 10 different language families and providing rich data resources. 2. **Evaluate multilingual models**: The paper benchmarks multiple multilingual models and finds that GPT - 4 performs excellently in zero - shot and few - shot settings, especially outperforming other models in low - resource languages. 3. **Cross - language transfer research**: Through experiments, researchers find that cross - language transfer is possible even between completely unrelated languages. For example, with only 500 Icelandic samples for fine - tuning, the model's performance on Chinese also reaches an MCC of 23.16. 4. **Probe experiments on grammatical ability**: Through probe experiments on XLM - R after fine - tuning MELA, researchers find that fine - tuning does improve the model's performance in grammar - related tasks, such as dependency relation tagging, syntactic tree structure tagging, etc. ### Conclusion The MELA benchmark not only provides an important tool for the evaluation of multilingual models but also offers a new perspective for studying the grammatical ability and cross - language transfer of multilingual models. Through this benchmark, researchers can gain a deeper understanding of the performance and limitations of multilingual models when processing different languages.