Diverse and Fine-Grained Instruction-Following Ability Exploration with Synthetic Data

Zihui Gu,Xingwu Sun,Fengzong Lian,Zhanhui Kang,Cheng-Zhong Xu,Ju Fan
2024-07-04
Abstract:Instruction-following is particularly crucial for large language models (LLMs) to support diverse user requests. While existing work has made progress in aligning LLMs with human preferences, evaluating their capabilities on instruction following remains a challenge due to complexity and diversity of real-world user instructions. While existing evaluation methods focus on general skills, they suffer from two main shortcomings, i.e., lack of fine-grained task-level evaluation and reliance on singular instruction expression. To address these problems, this paper introduces DINGO, a fine-grained and diverse instruction-following evaluation dataset that has two main advantages: (1) DINGO is based on a manual annotated, fine-grained and multi-level category tree with 130 nodes derived from real-world user requests; (2) DINGO includes diverse instructions, generated by both GPT-4 and human experts. Through extensive experiments, we demonstrate that DINGO can not only provide more challenging and comprehensive evaluation for LLMs, but also provide task-level fine-grained directions to further improve LLMs.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient evaluation of large language models (LLMs) in instruction - following ability. Although existing research has made LLMs align with human instructions through supervised instruction tuning or reinforcement learning from human feedback, comprehensively evaluating the instruction - following ability of these models still faces challenges. Specifically, there are two main problems in existing evaluation methods: 1. **Lack of fine - grained task - level evaluation**: This makes it difficult to improve the instruction - following ability of LLMs. For example, existing evaluation skills such as factuality include multiple subtasks, such as history knowledge question answering (History Knowledge QA) and chemical knowledge question answering (Chemical Knowledge QA). If an LLM performs poorly in chemical knowledge question answering, it may be because its response contains non - standard chemical formulas; if it performs poorly in history knowledge question answering, it may be that the key points are not clearly listed. 2. **Singularity of instruction expression**: This leads to the gap between real - world user instructions and existing evaluation datasets. Existing datasets usually use previous NLP datasets as evaluation data for specific skills and design specific instruction templates. However, in actual scenarios, the ways in which users express their requests are very diverse. To solve these problems, the paper introduces DINGO, which is a diverse and fine - grained instruction - following evaluation dataset. DINGO has the following two main advantages: - **Multi - level classification tree based on manual annotation**: This classification tree contains 130 nodes, originating from real - world user requests, and supports task - level analysis at different granularities. - **Diverse instruction data**: These instructions are generated by GPT - 4 and human experts, ensuring the diversity and high quality of the data. Through extensive experiments, the paper shows that DINGO can not only provide more challenging and comprehensive evaluations, but also provide fine - grained guidance directions for improving the instruction - following ability of LLMs.