PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Tingchen Fu,Mrinank Sharma,Philip Torr,Shay B. Cohen,David Krueger,Fazl Barez
2024-10-11
Abstract:Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.
Cryptography and Security,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
This paper attempts to address the vulnerability of large - language models (LLMs) to data - poisoning attacks during the preference - learning process. Specifically, the researchers introduced a benchmarking tool named POISON BENCH, which is used to evaluate the susceptibility of large - language models to data - poisoning attacks during the preference - learning stage. ### Problem Background With the wide application of large - language models (LLMs) in various fields, especially in aligning with human values through preference - learning, the safety and robustness of these models have become particularly important. However, the preference - learning process relies on crowdsourced human - annotated data, which may inadvertently introduce vulnerabilities. Malicious actors can mislead model training by injecting poisoned data, thereby manipulating model outputs for malicious purposes. This risk is particularly serious in sensitive areas such as medicine, law, and finance, because even minor errors can lead to serious consequences. ### The Role of POISON BENCH To meet this challenge, the researchers developed POISON BENCH, a benchmarking tool specifically designed to evaluate the vulnerability of large - language models to data - poisoning attacks during the preference - learning stage. POISON BENCH mainly consists of two evaluation subtasks: 1. **Content Injection**: The goal is to insert specific entities (such as brands or political figures) into the responses generated by LLMs, simulating potential commercial or political manipulation. 2. **Alignment Deterioration**: The goal is to trigger the deterioration of specific alignment goals (such as helpfulness or harmlessness) through predefined inputs, which may lead to unsafe or unreliable model behavior. ### Main Findings By using POISON BENCH to evaluate several widely - used LLMs, the researchers reached the following key conclusions: - Expanding the model parameter scale does not necessarily enhance resistance to poisoning attacks, so more advanced defense techniques are required. - There is a log - linear relationship between the attack effect and the proportion of poisoned data, that is, a small amount of poisoned data can cause significant changes in model behavior and even lead to catastrophic consequences. - The effect of poisoned data can be generalized to extrapolation triggers not included in the poisoned data, indicating the difficulty of backdoor detection and the risk of deceptive alignment. ### Conclusion This paper reveals the weaknesses in current preference - learning techniques and emphasizes the urgent need for more robust defense measures against malicious models and data manipulation. POISON BENCH provides researchers with a comprehensive framework to evaluate and compare the vulnerability of different LLMs in the face of data - poisoning attacks. Through these studies, the authors hope to promote the further development of the AI security field and ensure the reliability and safety of large - language models in practical applications.