Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

Yihan Chen,Benfeng Xu,Quan Wang,Yi Liu,Zhendong Mao

2024-01-01

Abstract:While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at

Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How capable are large - language models (LLMs) in responding to instructions with specific constraints? Although large - language models have demonstrated impressive instruction - following capabilities, it remains unclear whether and to what extent they can respond to the explicit constraints that may be included in the instructions. This includes tasks such as writing article summaries of a specific length or drafting emails with the expected sentiment, for example. Therefore, the paper proposes a new benchmark, CoDI - Eval, which aims to systematically and comprehensively evaluate the responses of large - language models to various constrained instructions in order to fill this research gap. By constructing a test suite containing a large number of instructions with constraint attributes, focusing on generality and coverage, the paper also advocates an instruction diversification process to synthesize diverse forms of constraint expressions, and carefully considers the candidate task classification system, and even finer - grained sub - categories. Finally, the paper automates the entire evaluation process to promote further development.

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

Evaluating Large Language Models at Evaluating Instruction Following

FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models

Can Large Language Models Understand Real-World Complex Instructions?

Find the Intention of Instruction: Comprehensive Evaluation of Instruction Understanding for Large Language Models

Benchmarking Complex Instruction-Following with Multiple Constraints Composition

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

Towards Understanding the Effectiveness of Large Language Models on Directed Test Input Generation

DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Control Large Language Models via Divide and Conquer

Benchmarking Large Language Models with Augmented Instructions for Fine-grained Information Extraction

CFBench: A Comprehensive Constraints-Following Benchmark for LLMs

InFoBench: Evaluating Instruction Following Ability in Large Language Models

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

Diverse and Fine-Grained Instruction-Following Ability Exploration with Synthetic Data

Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models

LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions