IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

Xinghua Zhang,Haiyang Yu,Cheng Fu,Fei Huang,Yongbin Li
2024-11-09
Abstract:In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient ability of large language models (LLMs) in handling complex instructions. Specifically, the paper points out that although there are currently some benchmark tests for evaluating the instruction - following ability of LLMs, there is a lack of algorithms specifically aimed at improving the complex - instruction - following ability. Therefore, the paper makes two main contributions: 1. **TRACE Benchmark**: This is a new benchmark test designed to evaluate and enhance the LLMs' ability to track complex instructions. TRACE contains 120,000 pieces of training data and 1,000 pieces of evaluation data, covering multiple constraint types and quantities. 2. **IOPO Method**: That is Input - Output Preference Optimization. Different from existing methods (such as DPO), IOPO not only considers output preferences but also deeply explores the fine - grained constraints in input instructions, thereby more effectively enhancing the LLMs' ability to understand and execute complex instructions. Through these contributions, the paper hopes to fill the gaps in existing research and provide a new method to enhance the performance of LLMs in handling complex instructions. Experimental results show that IOPO has achieved significant performance improvements on both in - domain and out - of - domain datasets.