Benchmarking Complex Instruction-Following with Multiple Constraints Composition

Bosi Wen,Pei Ke,Xiaotao Gu,Lindong Wu,Hao Huang,Jinfeng Zhou,Wenchuang Li,Binxin Hu,Wendy Gao,Jiaxin Xu,Yiming Liu,Jie Tang,Hongning Wang,Minlie Huang
2024-10-31
Abstract:Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints, which is an indispensable constituent in complex instructions. To this end, we propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints. We propose a hierarchical taxonomy for complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types, and manually collect a high-quality dataset accordingly. To make the evaluation reliable, we augment LLM-based evaluators with rules to effectively verify whether generated texts can satisfy each constraint and composition. Furthermore, we obtain the final evaluation score based on the dependency structure determined by different composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address the issue of evaluating the capabilities of large language models (LLMs) in handling complex instructions. Specifically, existing benchmarks mainly focus on different types of constraints in human instructions while neglecting the combinatorial structure between these constraints, which is an indispensable part of complex instructions. Therefore, how to comprehensively evaluate the ability of LLMs to handle complex instructions containing various combinations of constraints has become an important research question. ### Background and Motivation With the continuous improvement of LLMs' capabilities, they are increasingly being applied to handle complex instructions in the real world. However, existing benchmarks are insufficient in evaluating LLMs' ability to handle complex instructions, particularly ignoring the combinatorial structure between constraints. This combinatorial structure is a natural phenomenon in language use and has been a long-term research issue in the NLP community. Therefore, designing a benchmark that can comprehensively evaluate LLMs' ability to handle complex instructions has become particularly important. ### Solution To address the above issues, the authors propose a new benchmark framework called Complex Bench. The main features of Complex Bench include: 1. **Multi-level Classification System**: Defines 4 types of constraints, 19 constraint dimensions, and 4 types of combinations, providing a comprehensive perspective for evaluating LLMs' ability to handle complex instructions. 2. **High-Quality Dataset**: Manually constructs a high-quality dataset covering all types of constraints and combinations. 3. **New Automated Evaluation Method**: Combines LLM-based and rule-based methods, accurately evaluating whether the text generated by LLMs meets all constraints and combination types through dependency structure aggregation of the final score. ### Experiments and Results The authors conducted experiments on multiple existing LLMs using the proposed benchmark, systematically revealing their shortcomings in handling various constraints and combination types. The experimental results show that existing LLMs have significant deficiencies in handling complex instructions containing various combinations of constraints. ### Main Contributions 1. **Proposed a Comprehensive Multi-level Classification System**: Including 4 types of constraints, 19 constraint dimensions, and 4 types of combinations. 2. **Constructed a High-Quality Benchmark Dataset**: Covering all types of constraints and combinations. 3. **Designed a New Automated Evaluation Method**: Combining LLM-based and rule-based methods, aggregating the final score through dependency structure. Through these contributions, Complex Bench not only systematically reveals the shortcomings of existing LLMs in handling complex instructions but also provides valuable insights for improving LLMs' ability to handle various constraints and combination types.