Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability

Jorge García-Carrasco,Alejandro Maté,Juan Trujillo

DOI: https://doi.org/10.24963/ijcai.2024/43

2024-07-29

Abstract:Large Language Models (LLMs), characterized by being trained on broad amounts of data in a self-supervised manner, have shown impressive performance across a wide range of tasks. Indeed, their generative abilities have aroused interest on the application of LLMs across a wide range of contexts. However, neural networks in general, and LLMs in particular, are known to be vulnerable to adversarial attacks, where an imperceptible change to the input can mislead the output of the model. This is a serious concern that impedes the use of LLMs on high-stakes applications, such as healthcare, where a wrong prediction can imply serious consequences. Even though there are many efforts on making LLMs more robust to adversarial attacks, there are almost no works that study \emph{how} and \emph{where} these vulnerabilities that make LLMs prone to adversarial attacks happen. Motivated by these facts, we explore how to localize and understand vulnerabilities, and propose a method, based on Mechanistic Interpretability (MI) techniques, to guide this process. Specifically, this method enables us to detect vulnerabilities related to a concrete task by (i) obtaining the subset of the model that is responsible for that task, (ii) generating adversarial samples for that task, and (iii) using MI techniques together with the previous samples to discover and understand the possible vulnerabilities. We showcase our method on a pretrained GPT-2 Small model carrying out the task of predicting 3-letter acronyms to demonstrate its effectiveness on locating and understanding concrete vulnerabilities of the model.

Machine Learning,Computation and Language,Cryptography and Security

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the vulnerability of large - language models (LLMs) when facing adversarial attacks. Specifically, although LLMs perform well on many tasks, they are vulnerable to adversarial attacks, that is, tiny input perturbations can lead to incorrect model outputs. This vulnerability is especially concerning in high - risk applications (such as healthcare), because incorrect predictions may have serious consequences. Currently, although much work has been done to improve the robustness of LLMs against adversarial attacks, few studies have focused on the specific mechanisms of these vulnerabilities, that is, how and in which parts of the model these vulnerabilities occur. To this end, the author proposes a method based on Mechanistic Interpretability (MI) techniques, aiming to locate and understand the vulnerabilities in LLMs. This method is implemented through the following steps: 1. **Task description, dataset construction, and metric definition**: Clearly define the task or behavior to be studied, and construct the corresponding dataset and metric standards for evaluating the model's performance on this task. 2. **Circuit identification and understanding**: Use MI techniques such as activation patching to identify model components related to specific tasks, forming so - called "circuits". 3. **Adversarial sample generation**: Automatically generate adversarial samples for specific tasks, which will be used to detect potential vulnerabilities. 4. **Locating and understanding vulnerabilities**: Use the generated adversarial samples to conduct logit - attribution experiments, locate the model components affected by vulnerabilities, and further use MI techniques to understand the sources of vulnerabilities. Through this systematic method, the author hopes to gain a deeper understanding of the internal mechanisms of LLMs, thereby detecting, understanding, and ultimately mitigating or solving these vulnerabilities without additional adversarial training, while also avoiding possible side effects.

Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Can LLMs be Fooled? Investigating Vulnerabilities in LLMs

Towards Effectively Detecting and Explaining Vulnerabilities Using Large Language Models

Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats

Emerging Security Challenges of Large Language Models

Large Language Model for Vulnerability Detection: Emerging Results and Future Directions

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

On Evaluating Adversarial Robustness of Large Vision-Language Models

Exploring Vulnerabilities and Threats in Large Language Models: Safeguarding Against Exploitation and Misuse

Robustness of Large Language Models Against Adversarial Attacks

Exploring the Adversarial Capabilities of Large Language Models

Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models

Transforming Computer Security and Public Trust Through the Exploration of Fine-Tuning Large Language Models

Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs

Misusing Tools in Large Language Models With Visual Adversarial Examples

Exploring Vulnerabilities and Protections in Large Language Models: A Survey

Uncovering Safety Risks of Large Language Models through Concept Activation Vector