Abstract:Pre-trained large language models (LLMs) can be tailored to adhere to human instructions through instruction tuning. However, due to shifts in the distribution of test-time data, they may not always execute instructions accurately, potentially generating factual errors or misaligned content when acting as chat assistants. To enhance the reliability of LLMs in following instructions, we propose the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low. We train judge models that can predict numerical quality scores for model responses. To address data scarcity, we introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores. Our method leverages the model's inherent self-evaluation capability to extract information about response quality from labeled instruction-tuning data. It incorporates a gold reference answer to facilitate self-evaluation and recalibrates by assessing the semantic similarity between the response sample and the gold reference. During the training phase, we implement self-distillation as a regularization technique to enhance the capability of reference-free estimation. To validate alignment evaluation on general instruction-following tasks, we collect large-scale high-quality instructions from Hugging Face for model training and evaluation. Extensive experiments on five open-source models show that our method correlates much more with GPT-4 than strong baselines, e.g., supervised models distilled from GPT-4 and GPT-3.5-turbo. Our analysis shows our model's strong generalization across domains. Additionally, our judge models serve as good reward models, e.g., boosting WizardLM-13B-V1.2 from 89.17 to 92.48 and from 12.03 to 15.90 in version v1 and v2 of AlpacaEval respectively using best-of-32 sampling with our judge models.

What problem does this paper attempt to address?

The paper aims to address the issue of inaccuracies that large language models (LLMs) may produce when executing instructions, especially when the distribution of test data changes. These models may generate content that does not align with human preferences, such as incorrect information or irrelevant responses. To tackle this problem, the authors propose a method called selective instruction following, which allows the system to choose not to execute specific instructions when it predicts that the quality of its response will be low. This goal is achieved by training a discriminative model that can score the responses of the model. Additionally, to overcome the scarcity of high-quality scoring data, the researchers introduce a new self-training framework called SELF-J, which trains the discriminative model without the need for manually annotated quality scores. Specifically, SELF-J leverages the model's own self-assessment capabilities to extract information about response quality from labeled instruction-tuning data and combines it with reference answers to improve evaluation accuracy. During the training phase, the study employs self-distillation techniques as a regularization method to enhance the ability of reference-free evaluation. Extensive experiments validate the effectiveness of SELF-J and demonstrate its superior performance across multiple open-source models. In summary, the study highlights the potential of implementing aligned self-assessment in large language models.

Self-Judge: Selective Instruction Following with Alignment Self-Evaluation

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

Aligning Large Language Models by On-Policy Self-Judgment

Self-Taught Evaluators

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

Generative Judge for Evaluating Alignment

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4

HREF: Human Response-Guided Evaluation of Instruction Following in Language Models

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Reasons to Reject? Aligning Language Models with Judgments

Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers

EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

Mitigating the Bias of Large Language Model Evaluation

SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning