MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models

Wai-Chung Kwan,Xingshan Zeng,Yuxin Jiang,Yufei Wang,Liangyou Li,Lifeng Shang,Xin Jiang,Qun Liu,Kam-Fai Wong

2024-01-30

Abstract:Large language models (LLMs) are increasingly relied upon for complex multi-turn conversations across diverse real-world applications. However, existing benchmarks predominantly focus on single-turn evaluations, overlooking the models' capabilities in multi-turn interactions. To address this gap, we introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. We construct multi-turn queries for each category either by augmenting existing datasets or by creating new examples with GPT-4 to avoid data leakage. To study the factors impacting multi-turn abilities, we create single-turn versions of the 1170 multi-turn queries and compare performance. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks. We observe significant performance degradation in multi-turn settings compared to single-turn settings in most models, which is not correlated with the models' fundamental capabilities. Moreover, we identify the distance to relevant content and susceptibility to error propagation as the key factors influencing multi-turn performance. MT-Eval is released publicly to encourage future research towards more robust conversational models.

Computation and Language

What problem does this paper attempt to address?

The problem this paper attempts to address is that existing benchmarks mainly focus on single-turn dialogue evaluation, neglecting the performance of large language models (LLMs) in multi-turn dialogues. Specifically, the paper points out that although existing evaluation frameworks such as MMLU and MT-Bench have covered language understanding and some aspects of multi-turn dialogue capabilities, they do not comprehensively cover the various types and complexities of real-world multi-turn dialogues. To address this issue, the paper proposes a multi-turn dialogue capability evaluation benchmark named MT-Eval. This benchmark aims to comprehensively evaluate the performance of LLMs in multi-turn dialogues by analyzing human-LLM interactions and categorizing interaction patterns into four types: Recollection, Expansion, Refinement, and Follow-up. Each type of task has a specific evaluation set designed to ensure coverage of real-world application scenarios. Specifically, the main contributions of MT-Eval include: 1. Proposing a comprehensive multi-turn dialogue capability evaluation benchmark that covers a wide range of real-world scenarios. 2. Conducting an in-depth performance analysis of 11 popular LLMs, providing insights into their capabilities in multi-turn dialogues. 3. Identifying key factors affecting LLM multi-turn dialogue performance, such as the distance to relevant content and error propagation. 4. Emphasizing the importance of evaluating LLMs in multi-turn settings, revealing potential performance differences compared to single-turn evaluations. Through this work, the paper hopes to drive future research towards the development of more robust dialogue models.

MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

SimulBench: Evaluating Language Models with Creative Simulation Tasks

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion

MATEval: A Multi-Agent Discussion Framework for Advancing Open-Ended Text Evaluation

Self-Directed Turing Test for Large Language Models

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Holistic Evaluation of Language Models

M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models

A Comprehensive Analysis of the Effectiveness of Large Language Models As Automatic Dialogue Evaluators

Measuring Taiwanese Mandarin Language Understanding

Large Language Model Evaluation Via Multi AI Agents: Preliminary results

Evaluating and Enhancing LLMs for Multi-turn Text-to-SQL with Multiple Question Types

CMMLU: Measuring massive multitask language understanding in Chinese