Abstract:Self-evaluation using large language models (LLMs) has proven valuable not only in benchmarking but also methods like reward modeling, constitutional AI, and self-refinement. But new biases are introduced due to the same LLM acting as both the evaluator and the evaluatee. One such bias is self-preference, where an LLM evaluator scores its own outputs higher than others' while human annotators consider them of equal quality. But do LLMs actually recognize their own outputs when they give those texts higher scores, or is it just a coincidence? In this paper, we investigate if self-recognition capability contributes to self-preference. We discover that, out of the box, LLMs such as GPT-4 and Llama 2 have non-trivial accuracy at distinguishing themselves from other LLMs and humans. By fine-tuning LLMs, we discover a linear correlation between self-recognition capability and the strength of self-preference bias; using controlled experiments, we show that the causal explanation resists straightforward confounders. We discuss how self-recognition can interfere with unbiased evaluations and AI safety more generally.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper attempts to address a bias in large language models (LLMs) during self-evaluation—self-preference. Specifically, when the same LLM acts as both the evaluator and the evaluated, it may give higher scores to its own outputs, while human annotators consider these outputs to be of comparable quality to those generated by other LLMs or humans. This self-preference phenomenon can lead to unfair evaluation results, affecting the accuracy of model benchmarking, reward modeling, and applications like constitutional AI. ### Main Research Questions 1. **Is self-preference truly caused by self-recognition ability?** - Researchers explore whether LLMs can recognize their own outputs and if this recognition ability leads to self-preference. Specifically, they want to know if LLMs give higher scores because they recognize a text as their own. 2. **What is the relationship between self-recognition ability and self-preference?** - Through experiments, researchers measure the self-recognition ability and self-preference of different LLMs and explore the correlation between the two. They find that LLMs with stronger self-recognition abilities exhibit more pronounced self-preference. ### Experimental Methods - **Measurement of Self-Recognition Ability**: - Two methods are used to measure the self-recognition ability of LLMs: pairwise comparison and single-text evaluation. In pairwise comparison, LLMs need to choose which of two texts was generated by themselves; in single-text evaluation, LLMs need to judge whether a given text was generated by themselves. - **Measurement of Self-Preference**: - Similarly, two methods are used to measure self-preference: pairwise comparison and single-text evaluation. In pairwise comparison, LLMs need to choose which of two texts is of higher quality; in single-text evaluation, LLMs need to rate the given text. - **Fine-Tuning Experiments**: - By fine-tuning LLMs to alter their self-recognition ability and observing whether self-preference changes accordingly, researchers find a linear relationship between self-recognition ability and self-preference after fine-tuning. ### Main Findings 1. **State-of-the-art LLMs exhibit self-preference**: - In two summarization tasks, GPT-3.5 Turbo, GPT-4, and Llama 2 tend to give higher scores to their own summaries while rating summaries generated by other LLMs or humans lower. 2. **LLMs have non-trivial self-recognition ability**: - The three un-fine-tuned LLMs (GPT-3.5 Turbo, GPT-4, and Llama 2) can distinguish their own outputs from texts generated by other sources with an accuracy exceeding 50%. GPT-4 achieves an accuracy of 73.5%. 3. **Near-perfect self-recognition ability after fine-tuning**: - After fine-tuning with 500 samples, the self-recognition accuracy of GPT-3.5 and Llama 2 exceeds 90%. 4. **Linear correlation between self-preference intensity and self-recognition ability**: - Further fine-tuning of LLMs reveals a linear correlation between self-recognition ability and self-preference. ### Safety and Future Work - **Safety Issues**: - Self-recognition ability may impact the safety of multi-LLM interactions, especially in applications like model benchmarking, reward modeling, and constitutional AI. Self-preference may lead to inflated model scores, undermining the effectiveness of supervision and management. - **Future Work**: - Validate causal hypothesis: Although experiments provide evidence supporting the hypothesis that self-recognition ability leads to self-preference, more experiments are needed to rule out potential confounding factors. - Control generation quality: To better understand the causes of self-preference, it is necessary to control the quality of generated texts, ensuring that self-preference is not due to the LLM-generated texts being genuinely better. In summary, this paper demonstrates the relationship between self-recognition ability and self-preference through experiments and proposes future research directions to mitigate the impact of this bias on AI systems.

LLM Evaluators Recognize and Favor Their Own Generations

Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement

Self-Preference Bias in LLM-as-a-Judge

LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation

Self-Evaluation Improves Selective Generation in Large Language Models

SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

Large Language Models are Inconsistent and Biased Evaluators

Evaluating the Consistency of LLM Evaluators

Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions

Can LLM be a Personalized Judge?

Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments

See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses

Self-Evaluation as a Defense Against Adversarial Attacks on LLMs

LLM Voting: Human Choices and AI Collective Decision Making

Can Large Language Models Be an Alternative to Human Evaluations?

Unveiling Context-Aware Criteria in Self-Assessing LLMs

Self-Cognition in Large Language Models: An Exploratory Study

AI AI Bias: Large Language Models Favor Their Own Generated Content