InFoBench: Evaluating Instruction Following Ability in Large Language Models

Yiwei Qin,Kaiqiang Song,Yebowen Hu,Wenlin Yao,Sangwoo Cho,Xiaoyang Wang,Xuansheng Wu,Fei Liu,Pengfei Liu,Dong Yu

2024-01-08

Abstract:This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models' (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs' compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR's higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the current methods for evaluating the instruction - following ability of large language models (LLMs) are deficient. Specifically, existing evaluation methods such as A/B testing, overall scoring, and Elo rating, although effective to a certain extent, have problems of poor scalability and weak interpretability. These problems make the evaluation of LLMs in terms of complex - instruction following not comprehensive and reliable enough. To solve this problem, the paper introduces a new evaluation metric - Decomposed Requirements Following Ratio (DRFR). DRFR can analyze the performance of LLMs in various aspects of task execution more detailedly by decomposing complex instructions into simpler criteria. In addition, the paper also proposes a benchmark dataset named INFOBENCH, which contains 500 diverse instructions and 2,250 decomposed problems, covering multiple constraint categories, in order to systematically test and analyze the instruction - following ability of LLMs. Through these contributions, the paper aims to provide new tools and methods for the future development and evaluation of LLMs, especially for the improvement of complex - instruction following ability.

InFoBench: Evaluating Instruction Following Ability in Large Language Models

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

Evaluating Large Language Models at Evaluating Instruction Following

FollowEval: A Multi-Dimensional Benchmark for Assessing the Instruction-Following Capability of Large Language Models

Beyond Instruction Following: Evaluating Inferential Rule Following of Large Language Models

The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models

Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following

Benchmarking Complex Instruction-Following with Multiple Constraints Composition

Diverse and Fine-Grained Instruction-Following Ability Exploration with Synthetic Data

Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

ReIFE: Re-evaluating Instruction-Following Evaluation

CFBench: A Comprehensive Constraints-Following Benchmark for LLMs

Enhancing and Assessing Instruction-Following with Fine-Grained Instruction Variants

BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models

Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Conifer: Improving Complex Constrained Instruction-Following Ability of Large Language Models

TOWER: Tree Organized Weighting for Evaluating Complex Instructions