Abstract:As Large Language Models (LLMs) grow increasingly adept at managing complex tasks, the evaluation set must keep pace with these advancements to ensure it remains sufficiently discriminative. Item Discrimination (ID) theory, which is widely used in educational assessment, measures the ability of individual test items to differentiate between high and low performers. Inspired by this theory, we propose an ID-induced prompt synthesis framework for evaluating LLMs to ensure the evaluation set can continually update and refine according to model abilities. Our data synthesis framework prioritizes both breadth and specificity. It can generate prompts that comprehensively evaluate the capabilities of LLMs while revealing meaningful performance differences between models, allowing for effective discrimination of their relative strengths and weaknesses across various tasks and domains. To produce high-quality data, we incorporate a self-correct mechanism into our generalization framework, and develop two models to predict prompt discrimination and difficulty score to facilitate our data synthesis framework, contributing valuable tools to evaluation data synthesis research. We apply our generated data to evaluate five SOTA models. Our data achieves an average score of 51.92, accompanied by a variance of 10.06. By contrast, previous works (i.e., SELF-INSTRUCT and WizardLM) obtain an average score exceeding 67, with a variance below 3.2. The results demonstrate that the data generated by our framework is more challenging and discriminative compared to previous works. We will release a dataset of over 3,000 carefully crafted prompts to facilitate evaluation research of LLMs.

SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

Direct-Inverse Prompting: Analyzing LLMs' Discriminative Capacity in Self-Improving Generation

On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept

When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs

LLM Evaluators Recognize and Favor Their Own Generations

Understanding the Dark Side of LLMs' Intrinsic Self-Correction

LLMs can learn self-restraint through iterative self-reflection

Large Language Models have Intrinsic Self-Correction Ability

I Learn Better If You Speak My Language: Understanding the Superior Performance of Fine-Tuning Large Language Models with LLM-Generated Responses

Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement

Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs

Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method

Self-Evaluation Improves Selective Generation in Large Language Models

IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs

Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models

Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives

Self-Preference Bias in LLM-as-a-Judge

Intuitive or Dependent? Investigating LLMs' Behavior Style to Conflicting Prompts

When is the consistent prediction likely to be a correct prediction?