MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models

Wai-Chung Kwan,Xingshan Zeng,Yuxin Jiang,Yufei Wang,Liangyou Li,Lifeng Shang,Xin Jiang,Qun Liu,Kam-Fai Wong
2024-01-30
Abstract:Large language models (LLMs) are increasingly relied upon for complex multi-turn conversations across diverse real-world applications. However, existing benchmarks predominantly focus on single-turn evaluations, overlooking the models' capabilities in multi-turn interactions. To address this gap, we introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. We construct multi-turn queries for each category either by augmenting existing datasets or by creating new examples with GPT-4 to avoid data leakage. To study the factors impacting multi-turn abilities, we create single-turn versions of the 1170 multi-turn queries and compare performance. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks. We observe significant performance degradation in multi-turn settings compared to single-turn settings in most models, which is not correlated with the models' fundamental capabilities. Moreover, we identify the distance to relevant content and susceptibility to error propagation as the key factors influencing multi-turn performance. MT-Eval is released publicly to encourage future research towards more robust conversational models.
Computation and Language
What problem does this paper attempt to address?
The problem this paper attempts to address is that existing benchmarks mainly focus on single-turn dialogue evaluation, neglecting the performance of large language models (LLMs) in multi-turn dialogues. Specifically, the paper points out that although existing evaluation frameworks such as MMLU and MT-Bench have covered language understanding and some aspects of multi-turn dialogue capabilities, they do not comprehensively cover the various types and complexities of real-world multi-turn dialogues. To address this issue, the paper proposes a multi-turn dialogue capability evaluation benchmark named MT-Eval. This benchmark aims to comprehensively evaluate the performance of LLMs in multi-turn dialogues by analyzing human-LLM interactions and categorizing interaction patterns into four types: Recollection, Expansion, Refinement, and Follow-up. Each type of task has a specific evaluation set designed to ensure coverage of real-world application scenarios. Specifically, the main contributions of MT-Eval include: 1. Proposing a comprehensive multi-turn dialogue capability evaluation benchmark that covers a wide range of real-world scenarios. 2. Conducting an in-depth performance analysis of 11 popular LLMs, providing insights into their capabilities in multi-turn dialogues. 3. Identifying key factors affecting LLM multi-turn dialogue performance, such as the distance to relevant content and error propagation. 4. Emphasizing the importance of evaluating LLMs in multi-turn settings, revealing potential performance differences compared to single-turn evaluations. Through this work, the paper hopes to drive future research towards the development of more robust dialogue models.