InFoBench: Evaluating Instruction Following Ability in Large Language Models

Yiwei Qin,Kaiqiang Song,Yebowen Hu,Wenlin Yao,Sangwoo Cho,Xiaoyang Wang,Xuansheng Wu,Fei Liu,Pengfei Liu,Dong Yu
2024-01-08
Abstract:This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR's higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the current methods for evaluating the instruction - following ability of large language models (LLMs) are deficient. Specifically, existing evaluation methods such as A/B testing, overall scoring, and Elo rating, although effective to a certain extent, have problems of poor scalability and weak interpretability. These problems make the evaluation of LLMs in terms of complex - instruction following not comprehensive and reliable enough. To solve this problem, the paper introduces a new evaluation metric - Decomposed Requirements Following Ratio (DRFR). DRFR can analyze the performance of LLMs in various aspects of task execution more detailedly by decomposing complex instructions into simpler criteria. In addition, the paper also proposes a benchmark dataset named INFOBENCH, which contains 500 diverse instructions and 2,250 decomposed problems, covering multiple constraint categories, in order to systematically test and analyze the instruction - following ability of LLMs. Through these contributions, the paper aims to provide new tools and methods for the future development and evaluation of LLMs, especially for the improvement of complex - instruction following ability.