Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models

Qingcheng Zeng,Mingyu Jin,Qinkai Yu,Zhenting Wang,Wenyue Hua,Zihao Zhou,Guangyan Sun,Yanda Meng,Shiqing Ma,Qifan Wang,Felix Juefei-Xu,Kaize Ding,Fan Yang,Ruixiang Tang,Yongfeng Zhang

2024-07-19

Abstract:Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. One commonly used method to assess the reliability of LLMs' responses is uncertainty estimation, which gauges the likelihood of their answers being correct. While many studies focus on improving the accuracy of uncertainty estimations for LLMs, our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output. Specifically, the proposed backdoor attack method can alter an LLM's output probability distribution, causing the probability distribution to converge towards an attacker-predefined distribution while ensuring that the top-1 prediction remains unchanged. Our experimental results demonstrate that this attack effectively undermines the model's self-evaluation reliability in multiple-choice questions. For instance, we achieved a 100 attack success rate (ASR) across three different triggering strategies in four models. Further, we investigate whether this manipulation generalizes across different prompts and domains. This work highlights a significant threat to the reliability of LLMs and underscores the need for future defenses against such attacks. The code is available at <a class="link-external link-https" href="https://github.com/qcznlp/uncertainty_attack" rel="external noopener nofollow">this https URL</a>.

Computation and Language

What problem does this paper attempt to address?

This paper aims to address the vulnerability of large language models (LLMs) in uncertainty estimation and explore potential attack methods. Specifically, the researchers propose a simple yet effective backdoor attack method that can manipulate the model's uncertainty without altering the final output. By embedding specific triggers into the input, the attacker can activate a backdoor that shifts the model's uncertainty distribution towards a predefined distribution set by the attacker, thereby affecting the model's assessment of its own answer reliability. Experimental results show that this attack method is effective across multiple models and can achieve high success rates on different prompts and domain datasets. Additionally, the study finds that existing defense methods have limited effectiveness against this backdoor attack, highlighting the importance of addressing such vulnerabilities during model training and deployment.

Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models

Look Before You Leap: An Exploratory Study of Uncertainty Measurement for Large Language Models

Shifting Attention to Relevance: Towards the Uncertainty Estimation of Large Language Models

Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty

Rethinking the Uncertainty: A Critical Review and Analysis in the Era of Large Language Models

Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning

A Survey of Uncertainty Estimation in LLMs: Theory Meets Practice

Uncertainty Quantification for In-Context Learning of Large Language Models

"I'm Not Sure, But...": Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust

Label-Confidence-Aware Uncertainty Estimation in Natural Language Generation

Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios

Examining LLMs' Uncertainty Expression Towards Questions Outside Parametric Knowledge

Uncertainty Quantification for Clinical Outcome Predictions with (Large) Language Models

A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures

SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models

Unc-TTP: A Method for Classifying LLM Uncertainty to Improve In-Context Example Selection

Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility

Adversarial Attacks Against Uncertainty Quantification

Can LLMs Learn Uncertainty on Their Own? Expressing Uncertainty Effectively in A Self-Training Manner

Neutralizing Backdoors through Information Conflicts for Large Language Models