Abstract:Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.

What problem does this paper attempt to address?

This paper attempts to address the vulnerability of large - language models (LLMs) to data - poisoning attacks during the preference - learning process. Specifically, the researchers introduced a benchmarking tool named POISON BENCH, which is used to evaluate the susceptibility of large - language models to data - poisoning attacks during the preference - learning stage. ### Problem Background With the wide application of large - language models (LLMs) in various fields, especially in aligning with human values through preference - learning, the safety and robustness of these models have become particularly important. However, the preference - learning process relies on crowdsourced human - annotated data, which may inadvertently introduce vulnerabilities. Malicious actors can mislead model training by injecting poisoned data, thereby manipulating model outputs for malicious purposes. This risk is particularly serious in sensitive areas such as medicine, law, and finance, because even minor errors can lead to serious consequences. ### The Role of POISON BENCH To meet this challenge, the researchers developed POISON BENCH, a benchmarking tool specifically designed to evaluate the vulnerability of large - language models to data - poisoning attacks during the preference - learning stage. POISON BENCH mainly consists of two evaluation subtasks: 1. **Content Injection**: The goal is to insert specific entities (such as brands or political figures) into the responses generated by LLMs, simulating potential commercial or political manipulation. 2. **Alignment Deterioration**: The goal is to trigger the deterioration of specific alignment goals (such as helpfulness or harmlessness) through predefined inputs, which may lead to unsafe or unreliable model behavior. ### Main Findings By using POISON BENCH to evaluate several widely - used LLMs, the researchers reached the following key conclusions: - Expanding the model parameter scale does not necessarily enhance resistance to poisoning attacks, so more advanced defense techniques are required. - There is a log - linear relationship between the attack effect and the proportion of poisoned data, that is, a small amount of poisoned data can cause significant changes in model behavior and even lead to catastrophic consequences. - The effect of poisoned data can be generalized to extrapolation triggers not included in the poisoned data, indicating the difficulty of backdoor detection and the risk of deceptive alignment. ### Conclusion This paper reveals the weaknesses in current preference - learning techniques and emphasizes the urgent need for more robust defense measures against malicious models and data manipulation. POISON BENCH provides researchers with a comprehensive framework to evaluate and compare the vulnerability of different LLMs in the face of data - poisoning attacks. Through these studies, the authors hope to promote the further development of the AI security field and ensure the reliability and safety of large - language models in practical applications.

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Laws

Learning to Poison Large Language Models During Instruction Tuning

Preference Poisoning Attacks on Reward Model Learning

RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models

The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

Is poisoning a real threat to LLM alignment? Maybe more so than you think

Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data

Data Poisoning for In-context Learning

Measuring Impacts of Poisoning on Model Parameters and Embeddings for Large Language Models of Code

Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks

Persistent Pre-Training Poisoning of LLMs

Turning Generative Models Degenerate: The Power of Data Poisoning Attacks

Poisoning Language Models During Instruction Tuning

Measuring Impacts of Poisoning on Model Parameters and Neuron Activations: A Case Study of Poisoning CodeBERT

Concealed Data Poisoning Attacks on NLP Models

Pick your Poison: Undetectability versus Robustness in Data Poisoning Attacks

MetaPoison: Practical General-purpose Clean-label Data Poisoning

Lethal Dose Conjecture on Data Poisoning

With Great Dispersion Comes Greater Resilience: Efficient Poisoning Attacks and Defenses for Linear Regression Models