Abstract:Instruction following is one of the fundamental capabilities of large language models (LLMs). As the ability of LLMs is constantly improving, they have been increasingly applied to deal with complex human instructions in real-world scenarios. Therefore, how to evaluate the ability of complex instruction-following of LLMs has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints, which is an indispensable constituent in complex instructions. To this end, we propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints. We propose a hierarchical taxonomy for complex instructions, including 4 constraint types, 19 constraint dimensions, and 4 composition types, and manually collect a high-quality dataset accordingly. To make the evaluation reliable, we augment LLM-based evaluators with rules to effectively verify whether generated texts can satisfy each constraint and composition. Furthermore, we obtain the final evaluation score based on the dependency structure determined by different composition types. ComplexBench identifies significant deficiencies in existing LLMs when dealing with complex instructions with multiple constraints composition.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address the issue of evaluating the capabilities of large language models (LLMs) in handling complex instructions. Specifically, existing benchmarks mainly focus on different types of constraints in human instructions while neglecting the combinatorial structure between these constraints, which is an indispensable part of complex instructions. Therefore, how to comprehensively evaluate the ability of LLMs to handle complex instructions containing various combinations of constraints has become an important research question. ### Background and Motivation With the continuous improvement of LLMs' capabilities, they are increasingly being applied to handle complex instructions in the real world. However, existing benchmarks are insufficient in evaluating LLMs' ability to handle complex instructions, particularly ignoring the combinatorial structure between constraints. This combinatorial structure is a natural phenomenon in language use and has been a long-term research issue in the NLP community. Therefore, designing a benchmark that can comprehensively evaluate LLMs' ability to handle complex instructions has become particularly important. ### Solution To address the above issues, the authors propose a new benchmark framework called Complex Bench. The main features of Complex Bench include: 1. **Multi-level Classification System**: Defines 4 types of constraints, 19 constraint dimensions, and 4 types of combinations, providing a comprehensive perspective for evaluating LLMs' ability to handle complex instructions. 2. **High-Quality Dataset**: Manually constructs a high-quality dataset covering all types of constraints and combinations. 3. **New Automated Evaluation Method**: Combines LLM-based and rule-based methods, accurately evaluating whether the text generated by LLMs meets all constraints and combination types through dependency structure aggregation of the final score. ### Experiments and Results The authors conducted experiments on multiple existing LLMs using the proposed benchmark, systematically revealing their shortcomings in handling various constraints and combination types. The experimental results show that existing LLMs have significant deficiencies in handling complex instructions containing various combinations of constraints. ### Main Contributions 1. **Proposed a Comprehensive Multi-level Classification System**: Including 4 types of constraints, 19 constraint dimensions, and 4 types of combinations. 2. **Constructed a High-Quality Benchmark Dataset**: Covering all types of constraints and combinations. 3. **Designed a New Automated Evaluation Method**: Combining LLM-based and rule-based methods, aggregating the final score through dependency structure. Through these contributions, Complex Bench not only systematically reveals the shortcomings of existing LLMs in handling complex instructions but also provides valuable insights for improving LLMs' ability to handle various constraints and combination types.

Benchmarking Complex Instruction-Following with Multiple Constraints Composition

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models

CFBench: A Comprehensive Constraints-Following Benchmark for LLMs

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models

FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

InFoBench: Evaluating Instruction Following Ability in Large Language Models

Can Large Language Models Understand Real-World Complex Instructions?

TOWER: Tree Organized Weighting for Evaluating Complex Instructions

Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models

Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning

Constraint Back-translation Improves Complex Instruction Following of Large Language Models

Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models

Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models

Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability

Evaluating Large Language Models at Evaluating Instruction Following