Abstract:Recent studies show that LLMs, particularly open-source models, struggle to follow complex instructions with multiple constraints. Despite the importance, methods to improve LLMs' adherence to such constraints remain unexplored, and current research focuses on evaluating this ability rather than developing solutions. While a few studies enhance constraint adherence through model tuning, this approach is computationally expensive and heavily reliant on training data quality. An alternative is to leverage LLMs' self-correction capabilities, allowing them to adjust responses to better meet specified constraints. However, this self-correction ability of LLMs is limited by the feedback quality, as LLMs cannot autonomously generate reliable feedback or detect errors. Moreover, the self-refinement process heavily depends on few-shot examples that illustrate how to modify responses to meet constraints. As constraints in complex instructions are diverse and vary widely, manually crafting few-shot examples for each constraint type can be labor-intensive and sub-optimal. To deal with these two challenges, we propose the Divide-Verify-Refine (DVR) framework with three steps: (1) Divide complex instructions into single constraints and prepare appropriate tools; (2) Verify: To address the feedback quality problem, these tools will rigorously verify responses and provide reliable feedback; (3) Refine: To address the constraint diversity challenge, we design a refinement repository that collects successful refinement processes and uses them as few-shot demonstrations for future cases, allowing LLMs to learn from the past experience during inference. Additionally, we develop a new dataset of complex instructions, each containing 1-6 constraints. Experiments show that the framework significantly improves performance, doubling LLama3.1-8B's constraint adherence on instructions with 6 constraints.

What problem does this paper attempt to address?

This paper attempts to address the challenges that large - language models (LLMs) encounter when following complex instructions, especially when the instructions contain multiple constraints. Specifically, the paper mainly focuses on the following issues: 1. **Feedback Reliability Problem**: - The quality of feedback generated by LLMs during self - correction is low, resulting in unstable improvement effects and sometimes even performance degradation. - LLMs are unable to generate reliable feedback or detect errors independently, especially when dealing with multi - constraint instructions. 2. **Constraint Diversity Problem**: - Different types of constraints (such as text length, number of bullet points, inclusion of specific keywords, etc.) require different modification methods. - Manually creating few - shot examples for each constraint type is both time - consuming and inefficient. 3. **Limitations of Existing Datasets**: - Existing datasets lack complexity and internal consistency, leading to incomplete evaluations. - Most benchmark datasets only contain 1 - 2 constraints, while in actual application scenarios, more constraints may be involved. To solve these problems, the paper proposes a framework named Divide - Verify - Refine (DVR). The specific steps are as follows: 1. **Divide**: - Decompose complex instructions into individual constraint conditions and prepare corresponding tools. - For example, for an instruction requiring 4 key points, DVR will decompose it into the task of "checking the number of key points" and prepare corresponding tools (such as Python scripts). 2. **Verify**: - Use external tools to strictly verify whether the response meets each constraint condition and provide reliable feedback. - If the response does not meet the constraint condition, the tool will point out the specific error and suggest the direction of modification. For example, if 4 key points are required but only 2 are present, the tool will feedback "2 more key points need to be added". 3. **Refine**: - Use feedback and past successful refinement examples (from the refinement library) to adjust the response so that it meets all constraint conditions. - Successful refinement processes will be saved in the refinement library for future use. In addition, to ensure comprehensiveness of evaluation, the author also constructs a new complex - instruction dataset named ComplexInstruct, in which each instruction contains 1 - 6 constraint conditions. Experimental results show that the DVR framework significantly improves the ability of LLMs to follow complex instructions, especially when dealing with multiple constraint conditions. Through these methods, the DVR framework not only solves the problems of feedback reliability and constraint diversity but also provides a scalable and robust solution without the need for a large amount of retraining.

Divide-Verify-Refine: Aligning LLM Responses with Complex Instructions

LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models

Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models

Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning CodeLLMs

LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Constraint Back-translation Improves Complex Instruction Following of Large Language Models

Dancing in Chains: Reconciling Instruction Following and Faithfulness in Language Models

Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators

Distilling Instruction-following Abilities of Large Language Models with Task-aware Curriculum Planning

Building Accurate Translation-Tailored LLMs with Language Aware Instruction Tuning

Enhancing and Assessing Instruction-Following with Fine-Grained Instruction Variants

Align^2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

Can Large Language Models Understand Real-World Complex Instructions?

Evaluating Large Language Models at Evaluating Instruction Following

RuleR: Improving LLM Controllability by Rule-based Data Recycling

Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance