METAL: Towards Multilingual Meta-Evaluation

Rishav Hada,Varun Gumma,Mohamed Ahmed,Kalika Bali,Sunayana Sitaram
2024-04-02
Abstract:With the rising human-like precision of Large Language Models (LLMs) in numerous tasks, their utilization in a variety of real-world applications is becoming more prevalent. Several studies have shown that LLMs excel on many standard NLP benchmarks. However, it is challenging to evaluate LLMs due to test dataset contamination and the limitations of traditional metrics. Since human evaluations are difficult to collect, there is a growing interest in the community to use LLMs themselves as reference-free evaluators for subjective metrics. However, past work has shown that LLM-based evaluators can exhibit bias and have poor alignment with human judgments. In this study, we propose a framework for an end-to-end assessment of LLMs as evaluators in multilingual scenarios. We create a carefully curated dataset, covering 10 languages containing native speaker judgments for the task of summarization. This dataset is created specifically to evaluate LLM-based evaluators, which we refer to as meta-evaluation (METAL). We compare the performance of LLM-based evaluators created using GPT-3.5-Turbo, GPT-4, and PaLM2. Our results indicate that LLM-based evaluators based on GPT-4 perform the best across languages, while GPT-3.5-Turbo performs poorly. Additionally, we perform an analysis of the reasoning provided by LLM-based evaluators and find that it often does not match the reasoning provided by human judges.
Computation and Language
What problem does this paper attempt to address?
The problem addressed in the paper is the evaluation of large-scale language models (LLMs) as assessment tools in multilingual scenarios. Existing evaluation methods suffer from issues such as dataset contamination, limitations of traditional metrics, and lack of multilingual resources. The paper proposes a framework called METAL, which systematically evaluates summaries generated by LLMs by creating a carefully curated dataset containing multiple languages and evaluation dimensions, and compares them with human judgments. The research findings indicate that the evaluator based on GPT-4 performs better than GPT-3.5-Turbo and PaLM2 in various languages, but their reasoning often differs from human judgments. This suggests that there is still room for improvement in the ability of LLMs to act as evaluators in multilingual environments.