IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

Xinghua Zhang,Haiyang Yu,Cheng Fu,Fei Huang,Yongbin Li

2024-11-09

Abstract:In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient ability of large language models (LLMs) in handling complex instructions. Specifically, the paper points out that although there are currently some benchmark tests for evaluating the instruction - following ability of LLMs, there is a lack of algorithms specifically aimed at improving the complex - instruction - following ability. Therefore, the paper makes two main contributions: 1. **TRACE Benchmark**: This is a new benchmark test designed to evaluate and enhance the LLMs' ability to track complex instructions. TRACE contains 120,000 pieces of training data and 1,000 pieces of evaluation data, covering multiple constraint types and quantities. 2. **IOPO Method**: That is Input - Output Preference Optimization. Different from existing methods (such as DPO), IOPO not only considers output preferences but also deeply explores the fine - grained constraints in input instructions, thereby more effectively enhancing the LLMs' ability to understand and execute complex instructions. Through these contributions, the paper hopes to fill the gaps in existing research and provide a new method to enhance the performance of LLMs in handling complex instructions. Experimental results show that IOPO has achieved significant performance improvements on both in - domain and out - of - domain datasets.

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

AIPO: Improving Training Objective for Iterative Preference Optimization

ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization

Diverse and Fine-Grained Instruction-Following Ability Exploration with Synthetic Data

Beyond IID: Optimizing Instruction Learning from the Perspective of Instruction Interaction and Dependency

Aligning CodeLLMs with Direct Preference Optimization

Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization

InFoBench: Evaluating Instruction Following Ability in Large Language Models

TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

Offline Prompt Polishing for Low Quality Instructions

FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs

From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models

Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following

InstructZero: Efficient Instruction Optimization for Black-Box Large Language Models

Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models

Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process

WPO: Enhancing RLHF with Weighted Preference Optimization

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts